support SD3 #1374

kohya-ss · 2024-06-15T13:25:30Z

bghira · 2024-06-16T12:56:28Z

this is a chance to just use Diffusers modules instead of doing everything from scratch. why not take it?

kohya-ss · 2024-06-16T13:18:18Z

There are several reasons for this, but the biggest reason is that it is difficult to extend. For example, LoRA, custom ControlNet and Deep Shrink etc.

Also, considering the various processes in the training scripts, such as conditional loss, SNR, masked loss, etc., the training scripts need to be written from scratch.

bghira · 2024-06-16T16:55:41Z

all of that is done via peft other than deepshrink but you can make a pipeline callback for that.

bghira · 2024-06-16T16:57:01Z

i mean to use the sd3 transformer module from the diffusers project.

it is frustrating to see bespoke versions of things with unreadable comments always in this repository. can you at least leave better comments?

kohya-ss · 2024-06-16T23:09:25Z

I think transformer module should be extendable for the future. In addition, SD3 transformer is based on sd3-ref (Stability AI official repo), and modified by KBlueLeaf to support xformers etc. So it is prior to Diffusers, and not full scratch. I appreciate your understanding.

I will add better comments in future codes, including this PR.

araleza · 2024-07-10T07:58:17Z

Hello, I have been trying out SD3 training. It seems to be working pretty well. 😊

One thing I noticed is that generation of sample images while training is not yet implemented. This made it hard for me to see how my SD3 training was going, and make adjustments.

Implementing full support for all the sample images was difficult, but I found a cheap way to get most features working, and now I have sample images working again. This code is not properly integrated with the usual sample image generation code, but if people want to use it while they wait for a real well-integrated implementation, it does the basics of what's needed.

Just go into your sd3_train.py file, and find this commented-out section:

                # sdxl_train_util.sample_images(
                #     accelerator,
                #     args,
                #     None,
                #     global_step,
                #     accelerator.device,
                #     vae,
                #     [tokenizer1, tokenizer2],
                #     [text_encoder1, text_encoder2],
                #     mmdit,
                # )

and replace that with this:

                # Generate sample images
                if args.sample_every_n_steps is not None and global_step % args.sample_every_n_steps == 0:
                    from sd3_minimal_inference import do_sample
                    from PIL import Image
                    import datetime
                    import numpy as np
                    import shlex
                    import random

                    assert args.save_t5xxl, "When generating sample images in SD3, --save_t5xxl parameter must be set"

                    with open(args.sample_prompts, 'r') as file:
                        lines = [line.strip() for line in file if line.strip()]

                    vae.to("cuda")
                    for line in lines:
                        logger.info(f"Generating image: {line}")

                        if line.find('--') != -1:
                            prompt = line[:line.find('--') - 1].strip()
                            line = line[line.find('--'):]
                        else:
                            prompt = line
                            line = ''

                        parser_s = argparse.ArgumentParser()
                        parser_s.add_argument("--w", type=int, action="store", default=1024, help="image width")
                        parser_s.add_argument("--h", type=int, action="store", default=1024, help="image height")
                        parser_s.add_argument("--s", type=int, action="store", default=30,   help="sample steps")
                        parser_s.add_argument("--l", type=int, action="store", default=4,    help="CFG")
                        parser_s.add_argument("--d", type=int, action="store", default=random.randint(0, 2**32 - 1), help="seed")
                        prompt_args = shlex.split(line)
                        args_s = parser_s.parse_args(prompt_args)

                        # prepare embeddings
                        lg_out, t5_out, pooled = sd3_utils.get_cond(prompt, sd3_tokenizer, clip_l, clip_g, t5xxl) # +'ve prompt
                        cond = torch.cat([lg_out, t5_out], dim=-2), pooled

                        lg_out, t5_out, pooled = sd3_utils.get_cond("", sd3_tokenizer, clip_l, clip_g, t5xxl) # No -'ve prompt
                        neg_cond = torch.cat([lg_out, t5_out], dim=-2), pooled

                        latent_sampled = do_sample(
                            args_s.h, args_s.w, None, args_s.d, cond, neg_cond, mmdit, args_s.s, args_s.l, weight_dtype, accelerator.device
                        )

                        # latent to image
                        with torch.no_grad():
                            image = vae.decode(latent_sampled)
                        image = image.float()
                        image = torch.clamp((image + 1.0) / 2.0, min=0.0, max=1.0)[0]
                        decoded_np = 255.0 * np.moveaxis(image.cpu().numpy(), 0, 2)
                        decoded_np = decoded_np.astype(np.uint8)
                        out_image = Image.fromarray(decoded_np)

                        # save image
                        output_dir = os.path.join(args.output_dir, "sample")
                        os.makedirs(output_dir, exist_ok=True)
                        output_path = os.path.join(output_dir, f"{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.png")
                        out_image.save(output_path)

                    vae.to("cpu")

It supports a caption followed by the usual optional --w, --h, --s, --l, --d (for width, height, steps, cfg, and seed). It doesn't support negative captions, and it won't work right with captions longer than 75 tokens.

I'm finding sample image generation to be helpful. For example, I notice that most of my sample output images start off by looking brighter than expected (with white or bright backgrounds). Edit: Might have been my cfg of 7.5; SD3 seems to want lower cfgs. I had to push the sample count up as the cfg was lowered. Image quality still seems poor though, compared to what some people are getting out of SD3.

araleza · 2024-07-10T12:21:10Z

Think I've found an issue that's causing the poor quality SD3 samples. The do_sample() function is not filling in the shift parameter that's required by SD3, and it's defaulting to 1.0 instead of the recommended 3.0:

class ModelSamplingDiscreteFlow:
    """Helper for sampler scheduling (ie timestep/sigma calculations) for Discrete Flow models"""

    def __init__(self, shift=1.0):
        self.shift = shift
        timesteps = 1000
        self.sigmas = self.sigma(torch.arange(1, timesteps + 1, 1))

From sd-script's sd3_minimal_inference.py function, do_sample()

    model_sampling = sd3_utils.ModelSamplingDiscreteFlow()

From the SD3 paper:

~~The paper also seems to say that these shifts to the sigmas should be present during training. Are these maybe missing too, @kohya-ss?~~ (Edit: No, a shift value of 3.0 is already set up correctly during training)

kohya-ss · 2024-07-10T23:02:29Z

Think I've found an issue that's causing the poor quality SD3 samples. The do_sample() function is not filling in the shift parameter that's required by SD3, and it's defaulting to 1.0 instead of the recommended 3.0:

Thank you! I fixed it. The generated images seemed to be better now.

kohya-ss · 2024-07-10T23:09:06Z

I agree that the sample image generation is really useful. In my understanding, T5XXL is on CPU, so I wonder get_cond may take a long time. How much time it takes?

I think it might be necessary to get TE's output for the sampling prompt in advance, at the same time the TE caching. However, if T5XXL works on CPU with an acceptable time, the implementation of the sample generation will be much easier (like your implementation :) .

bghira · 2024-07-11T00:42:58Z

it takes about 30-50 seconds to run T5 XL on the CPU, i think XXL is even worse latency for each embed

araleza · 2024-07-11T08:34:58Z

I agree that the sample image generation is really useful. In my understanding, T5XXL is on CPU, so I wonder get_cond may take a long time. How much time it takes?

@kohya-ss, the calls to get_cond() only take around 2 seconds each on my machine. The whole sample image generation takes just 16 seconds per image for me, and I am still doing 80 sample steps for the images. :D

My PC is an ordinary (but good) home PC machine with a 13th gen Intel i7, and I've got 64 GB of CPU RAM. Perhaps the people finding the T5 XL to be very slow are running out of CPU memory and swapping the T5 XL out to disk without realizing? @bghira

kohya-ss · 2024-07-11T13:10:33Z

Thank you @bghira and @araleza ! I test with T5XXL on GPU, and it takes less than 2 seconds on GPU as araleza wrote, and it seems to require about 32GB additional main RAM... So practically, it may be needed to cache TE's output of the sample prompt.

bghira · 2024-07-11T13:16:22Z

yes, and the text encoder being trained will cause the problem. but maybe they shouldn't be trained 🤷

bghira · 2024-07-11T13:17:02Z

@araleza lol i'm running on 8x H100 system with more than 1.6TB of system memory and high-end EPYC

araleza · 2024-07-11T15:54:48Z

Hello! So far, I've had to run SD3 training in full_bf16 mode, because I run out of (24GB) VRAM if I did not choose this option. I've now found a way to run training in full fp32.

This line of code pushes VRAM usage much higher:

            mmdit = accelerator.prepare(mmdit)

and then shortly after, the VRAM usage drops when the T5XXL is moved from GPU to CPU:

            t5xxl.to("cpu", dtype=torch.float32)

If I change these two code sections around, the peak VRAM usage is much lower. So that's switching the section that starts if args.cache_text_encoder_outputs: with the section that starts if args.deepspeed:.

It looks like training in fp32 mode may improve quality significantly, although it is slower.

I may be using more VRAM than needed because I'm not using Deepspeed yet. (I tried to use it, but I got an error message, and I haven't looked into what the cause of this error message is). But these code sections are still worth swapping to avoid the VRAM spike for people who are not using it.

(Edit: I'm not sure Deepspeed even works with SD3 just now, so maybe everyone with 24GB is currently running out of VRAM when trying to use fp32?)

FurkanGozukara · 2024-07-11T22:37:10Z

reducing peak vram is so important

kohya-ss · 2024-07-12T12:32:09Z

If I change these two code sections around, the peak VRAM usage is much lower. So that's switching the section that starts with the section that starts .if args.cache_text_encoder_outputs:``if args.deepspeed:

That's right. When we cache the Text Encoder outputs (it is necessary for now), Text Encoders can be moved to CPU before preparing MMDiT. I updated the script.

In my env, SD3 training works with the mixed precision with 24GB, with AdaFactor optimizer and gradient checkpointing. I believe it will not work without the mixed precision.

araleza · 2024-07-12T12:38:17Z

In my env, SD3 training works with the mixed precision with 24GB, with AdaFactor optimizer and gradient checkpointing. I believe it will not work without the mixed precision.

Huh, this is confusing... 🤔 I've now been training SD3 with full fp32 for everything over the last day, and it all works great. The quality with fp32 is amazing compared to bf16. I'm also using Adafactor with the same optimizer settings as I usually use for sdxl (i.e. 'scale_parameter=False relative_step=False warmup_init=False').

I'm even using a batch size of 4, and I still have VRAM to spare:

|    0   N/A  N/A      8186      C   .../Dev/sd3/sd-scripts/venv/bin/python      20202MiB |

I don't have the --mixed_precision=bf16 or the --full_bf16 flags set (or any fp16 flags either).

bghira · 2024-07-12T12:40:34Z

yes, but how high can the batch size go with mixed precision? 4 is very low, bordering on useless?

araleza · 2024-07-12T12:53:48Z

I've been using batch size 4 for a long time - it seems pretty good for fine tuning to add a concept. Maybe people who want to do continued pretraining would want to use a higher batch size, but they'll have more VRAM than 24 GB.

I've just tried batch size 6 (with full fp32) now, and it works as well. Batch size 8 ran out of VRAM after around 50 steps.

@kohya-ss, here's my command line parameters if they're useful to you:

--pretrained_model_name_or_path="/home/ara/Dev/training/earthscape/kohya/dreambooth/at-step00008900.safetensors" --clip_l="/home/ara/Dev/sd3/clip_l.safetensors" --clip_g="/home/ara/Dev/sd3/clip_g.safetensors" --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --train_data_dir="/home/ara/Dev/training/earthscape/kohya/img" --resolution="1024,1024" --output_dir="/home/ara/Dev/training/earthscape/kohya/dreambooth" --caption_extension=".txt" --logging_dir="/home/ara/Dev/training/earthscape/kohya/log" --save_model_as=safetensors --lr_scheduler_num_cycles="20000" --max_data_loader_n_workers="0" --lr_scheduler="constant_with_warmup" --lr_warmup_steps="100" --max_train_steps="160000" --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=32 --v_pred_like_loss="0.5" --save_every_n_steps="100" --save_last_n_steps="200" --gradient_checkpointing --sdpa --bucket_no_upscale --sample_sampler=k_dpm_2 --sample_prompts="/home/ara/Dev/training/earthscape/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="100" --cache_latents --loss_type=huber --train_batch_size="4" --enable_wildcard --alpha_mask --cache_text_encoder_outputs  --cache_latents --cache_latents_to_disk --learning_rate=4e-7 --save_t5xxl

kohya-ss · 2024-07-12T13:18:49Z

@araleza Thank you! I've tested without the mixed precision (--mixed_precision no for accelerate, and removing --mixed_precision option for sd3_train), and it works!

Surprisingly, with batch_size=1, fp32 training seems to use less memory than bf16. I am wondering if there might be something wrong with the model and will investigate.

FurkanGozukara · 2024-07-12T13:37:53Z

Currently are we able to train clip text encoders? That lower ones not t5

I am guessing training model + clip text encoders would yield better but I didn't try yet I am still waiting main branch merge

araleza · 2024-07-13T07:26:10Z

@FurkanGozukara, training the original text encoders (clip_l and clip_g) is not currently supported. That's because SD3 training currently requires caching of the text encoder outputs at the start of training, which means the text encoder weights cannot then be updated.

The reason for the forced caching is that running the T5XXL text encoder takes several seconds, assuming you can even fit it in CPU RAM. Kohya mentioned it takes 32 GB, which is more than some people have. Even if it fit in memory, it takes around 2 seconds to convert the captions into embeddings, which would have to happen on each training step.

The obvious solution would be to cache the T5XXL outputs only, allowing clip_l and clip_g to train. But code to cache one of the three text encoders but not the other two has not currently been written.

FurkanGozukara · 2024-07-13T07:36:56Z

@araleza thanks a lot for detailed explanation. I hope that code comes sooner

sdbds · 2024-07-13T07:43:18Z

TEST RESULTS:
FP32 VS MixFP16

FP32 is much better and there's no difference in VRAM usage.
It seems like SDXL doesn't have that noticeable of a difference.

bghira · 2024-07-13T15:24:38Z

fp32 should definitely use more vram than full bf16 / fp16 and if it isn't, there might be something wrong.

bghira · 2024-07-20T23:27:10Z

make sure to apply t5 attention mask in the attentionprocessor @kohya-ss

Add Flux ControlNet

…ve argument help text

kohya-ss · 2024-12-02T14:42:41Z

FLUX.1 ControlNet training is merged #1813. Thank you minux302 for the contribution!

Currently, 80GB VRAM is needed for 1024x1024 training. As soon as I have time, I will try using block swap for ControlNet to see if I can reduce the required VRAM.

FurkanGozukara · 2024-12-02T22:34:36Z

@kohya-ss amazing

any sample dataset and dataset toml file that we can take a look at for controlnet?

kohya-ss · 2024-12-02T23:45:50Z

flux_train_control_net.py now supports --blocks_to_swap. It should run with 16 or 24GB VRAM.

kohya-ss · 2024-12-02T23:51:37Z

any sample dataset and dataset toml file that we can take a look at for controlnet?

We can use fill50k dataset from the original ControlNet: https://huggingface.co/lllyasviel/ControlNet/tree/main/training

Places the captions as .txt files in target. The dataset config is something like this:

[general]
resolution = [1024, 1024]

[[datasets]]
batch_size = 1
enable_bucket = false

  [[datasets.subsets]]
  image_dir = "/path/to/fill50k/target"
  caption_extension = ".txt"
  conditioning_data_dir = "/path/to/fill50k/source"

I'll write more details when I have time.

FurkanGozukara · 2024-12-04T13:03:19Z

@kohya-ss thanks

now i looked that dataset

e.g.

this is both source and target

{"source": "source/19123.png", "target": "target/19123.png", "prompt": "sandy brown circle with moccasin background"}

if you can elaborate more that would be amazing thank you

kohya-ss · 2024-12-04T13:11:27Z

target/19123.png should be like this:

I think this is an appropriate image for the caption "sandy brown circle with moccasin background".

FurkanGozukara · 2024-12-04T14:05:43Z

@kohya-ss thank you so much I understand now

so this is training a canny controlnet right?

kohya-ss · 2024-12-04T14:21:26Z

so this is training a canny controlnet right?

This is just a toy dataset, it can only control the position of the circle with a condition image, and the color of the circle and background with a prompt...

FurkanGozukara · 2024-12-05T19:24:48Z

so this is training a canny controlnet right?

This is just a toy dataset, it can only control the position of the circle with a condition image, and the color of the circle and background with a prompt...

so are there any real controlnet dataset that we can take a look at? thank you

…e_scaled_pos_embed

Workflow tests fixes and documentation

* Update sd3_train.py * add freeze block lr * Update train_util.py * update * Revert "add freeze block lr" This reverts commit 8b16535. # Conflicts: # library/train_util.py # sd3_train.py * use same control net model path * use controlnet_model_name_or_path

…cript

kohya-ss · 2024-12-07T08:32:04Z

The option to specify the model name for existing ControlNet model has been unified for each ControlNet training script. Please specify --controlnet_model_name_or_path. Thanks to sdbds!

Prevent git credentials from leaking into other actions

Added RAdamScheduleFree support

…izer support

kohya-ss · 2024-12-15T10:41:35Z

RAdamScheduleFree optimizer is now supported. Please update schedulefree to 1.4.

FurkanGozukara · 2024-12-15T21:52:29Z

any chance we could get pivotal tuning ? https://huggingface.co/blog/sdxl_lora_advanced_script#pivotal-tuning

araleza · 2024-12-18T23:08:57Z

I just wanted to report that after using it for a few days, the new schedule-free RAdam optimizer seems very strong. I thought it might not be that good as it just seemed to offer a better warmup (so better for a few iterations and then just the same as normal after that), but it seems to produce improved quality results even after that.

You can activate it with --optimizer_type radamschedulefree on the command line. Despite the documentation recommending a default learning rate of 2.5e-3, I'm finding 1e-5 to 3e-5 to be more appropriate, at least for training a LoRA on some of my datasets.

bghira · 2024-12-19T00:41:54Z

i think every new option added has somebody claim its the best or strongest choice lol

5KilosOfCheese mentioned this pull request Jun 18, 2024

Stable Diffusion 3 bmaltais/kohya_ss#2588

Open

This was referenced Jul 21, 2024

SD3 Gui Sampling + lora extraction bmaltais/kohya_ss#2655

Open

SD# Not seeing any samples bmaltais/kohya_ss#2633

Open

kohya-ss and others added 2 commits December 2, 2024 23:32

Merge pull request #1813 from minux302/flux-controlnet

09a3740

Add Flux ControlNet

docs: update README with FLUX.1 ControlNet training details and impro…

e369b9a

…ve argument help text

rockerBOO added 4 commits December 2, 2024 13:39

Update workflow tests with cleanup and documentation

5ab00f9

Add tests documentation

63738ec

Pytest

2610e96

Add more resources

3e5d89c

feat: support block_to_swap for FLUX.1 ControlNet training

8b36d90

kohya-ss and others added 4 commits December 7, 2024 15:12

fix: resolve model corruption issue with pos_embed when using --enabl…

6bee18d

…e_scaled_pos_embed

Merge pull request #1817 from rockerBOO/workflow-tests-fixes

2be3366

Workflow tests fixes and documentation

feat: unify ControlNet model name option and deprecate old training s…

e425996

…cript

rockerBOO and others added 5 commits December 9, 2024 15:20

Prevent git credentials from leaking into other actions

3cb8cb2

add RAdamScheduleFree support

8e378cf

Merge pull request #1828 from rockerBOO/workflow-security-audit

d3305f9

Prevent git credentials from leaking into other actions

Merge pull request #1830 from nhamanasu/sd3

f2d38e6

Added RAdamScheduleFree support

update requirements.txt and README to include RAdamScheduleFree optim…

e896539

…izer support

support SD3 #1374

Are you sure you want to change the base?

support SD3 #1374

Conversation

kohya-ss commented Jun 15, 2024 • edited Loading

bghira commented Jun 16, 2024

kohya-ss commented Jun 16, 2024

bghira commented Jun 16, 2024

bghira commented Jun 16, 2024

kohya-ss commented Jun 16, 2024

araleza commented Jul 10, 2024 • edited Loading

araleza commented Jul 10, 2024 • edited Loading

kohya-ss commented Jul 10, 2024

kohya-ss commented Jul 10, 2024

bghira commented Jul 11, 2024

araleza commented Jul 11, 2024 • edited Loading

kohya-ss commented Jul 11, 2024

bghira commented Jul 11, 2024

bghira commented Jul 11, 2024

araleza commented Jul 11, 2024 • edited Loading

FurkanGozukara commented Jul 11, 2024

kohya-ss commented Jul 12, 2024

araleza commented Jul 12, 2024

bghira commented Jul 12, 2024

araleza commented Jul 12, 2024 • edited Loading

kohya-ss commented Jul 12, 2024

FurkanGozukara commented Jul 12, 2024

araleza commented Jul 13, 2024

FurkanGozukara commented Jul 13, 2024

sdbds commented Jul 13, 2024

bghira commented Jul 13, 2024

bghira commented Jul 20, 2024

kohya-ss commented Dec 2, 2024

FurkanGozukara commented Dec 2, 2024

kohya-ss commented Dec 2, 2024

kohya-ss commented Dec 2, 2024

FurkanGozukara commented Dec 4, 2024

kohya-ss commented Dec 4, 2024

FurkanGozukara commented Dec 4, 2024

kohya-ss commented Dec 4, 2024

FurkanGozukara commented Dec 5, 2024

kohya-ss commented Dec 7, 2024

kohya-ss commented Dec 15, 2024

FurkanGozukara commented Dec 15, 2024

araleza commented Dec 18, 2024

bghira commented Dec 19, 2024

kohya-ss commented Jun 15, 2024 •

edited

Loading

araleza commented Jul 10, 2024 •

edited

Loading

araleza commented Jul 10, 2024 •

edited

Loading

araleza commented Jul 11, 2024 •

edited

Loading

araleza commented Jul 11, 2024 •

edited

Loading

araleza commented Jul 12, 2024 •

edited

Loading